Accelerate Distributed Learning all about

RDMA, ps-lite, Distribute training, Parameter Server and Ring All-reduce

presenter Jingrong Chen
time: 17th.08.2018

What i learn

1. RDMA

2.IRN (Revisiting Network Support for RDMA)

3.RoCEv2 + PFC -> DCQCN

4.iWARP

5.Distributing Training

Data distributed

Model distributed

6.Communication Structure

Parameter Server (Tensorflow)

Ring All-reduce

7.Parameter Server

Nature: KVStore
Benefit:

Asynchronized Update

make fault tolerance easily

8.Ring All-reduce

Benefit:
Disadvantage:

No fault tolerance

Not suitable for cloud
Implementations:

Tensorflow + Uber Horovod

Baidu ring-allreduce(not available)

9. ps-lite

MXNet and ps-lite are decoupled, which means:

No memory management in ps-lite

No assumption on tensor size -> need rendezvous mode

1 vs. N communication

10. Programming on Verbs

Memory must be registered before use -> manage memory manually

Work completion handler cannot block the CQ polling thread -> thread poll / coroutine

Number of outstanding SR cannot exceed the SQ size, as well as number of outstanding RR on the remote side -> flow control

Small and large message -> Eager mode & Rendezvous mode

For more

This is Yiqing Ma ‘s website.

Accelerate Distributed Learning Related

Accelerate Distributed Learning all about

What i learn

For more

If life deals you lemons, make lemonade….